Introduction

             This tutorial demonstrates the application of MARVEL for integrated gene and splicing analysis of RNA-sequencing data generated from droplet-based library preparation methods such as 10x Genomics. Only the most salient options of the functions will be elaborated here. For more details on the function’s options, please launch the help page with ?function_name.
             The dataset used to demonstrate the utility of MARVEL here includes induced pluripotent stem cells (iPSCs) and iPSC-induced cardiomyocytes at day-2, -4, and -10 (Ou et al., 2021). The processed input files required for this tutorial may be downloaded from here: http://datashare.molbiol.ox.ac.uk/public/wwen/MARVEL/Tutorial/Droplet/Data.zip

Load package

# Load MARVEL package
library(MARVEL)

# Load adjunct packages for this tutorial
library(Matrix)
library(data.table)
library(ggplot2)
library(gridExtra)

Input files

             In this section, we will read in the required input files for MARVEL. The formatting requirement for each input file will be explained.

Normalised gene expression

# Matrix
path <- "Data/Gene_SingCellaR/"
file <- "matrix_normalised.mtx"
df.gene.norm <- readMM(paste(path, file, sep=""))

df.gene.norm[df.gene.norm[,1] !=0,][1:5,1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##                                                      
## [1,] 2.1168501 .        0.9731413 0.4139587 0.9874593
## [2,] 0.7056167 .        .         .         .        
## [3,] 2.1168501 3.453436 1.9462826 3.3116695 1.9749185
## [4,] 1.4112334 .        0.4865707 0.4139587 1.4811889
## [5,] 4.2337003 2.302291 4.3791359 3.3116695 1.9749185
# cell metadata
path <- "Data/Gene_SingCellaR/"
file <- "phenoData.txt"
df.gene.norm.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.norm.pheno)
##                       cell.id   donor.id    cell.type
## 1 AAACCTGAGCCGGTAA_SRR9008752 SRR9008752 Cardio day 4
## 2 AAACCTGAGTACCGGA_SRR9008752 SRR9008752 Cardio day 4
## 3 AAACCTGAGTGCTGCC_SRR9008752 SRR9008752 Cardio day 4
## 4 AAACCTGAGTGTACGG_SRR9008752 SRR9008752 Cardio day 4
## 5 AAACCTGAGTGTCCAT_SRR9008752 SRR9008752 Cardio day 4
## 6 AAACCTGCAGCCTGTG_SRR9008752 SRR9008752 Cardio day 4
# feature metadata
path <- "Data/Gene_SingCellaR/"
file <- "featureData.txt"
df.gene.norm.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.norm.feature)
##   gene_short_name
## 1     MIR1302-2HG
## 2         FAM138A
## 3           OR4F5
## 4      AL627309.1
## 5      AL627309.3
## 6      AL627309.2
  • For the sparse matrix, rows represent genes, columns represent cells, and the values represent normalised gene expression values that has not yet been log2-transformed.
  • For the master cell metadata, each row represents one cell. Compulsory column is cell.id and this will be used to annotate the columns of the sparse matrix. All other columns are optional.
  • For the feature metadata, each row represents one gene. Compulsory column is gene_short_name and will be used to annotate the rows of the sparse matrix. All other columns are optional.
  • Here, the normalised gene expression matrix was generated using Cell Ranger and only cell passing QC was retained using SingCellaR (Wang et al., 2022).

Gene counts

# Matrix
path <- "Data/Gene_STARsolo/"
file <- "matrix_counts.mtx"
df.gene.count <- readMM(paste(path, file, sep=""))

df.gene.count[df.gene.count[,1] !=0,][1:5,1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##                  
## [1,] 1  2 1  3  2
## [2,] 1  2 .  .  .
## [3,] 1  1 .  .  .
## [4,] 5  6 8 16  8
## [5,] 3 12 9 10 13
# cell metadata
path <- "Data/Gene_STARsolo/"
file <- "phenoData.txt"
df.gene.count.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.count.pheno)
##                       cell.id
## 1 AAACCTGAGAAGATTC_SRR9008754
## 2 AAACCTGAGAAGGGTA_SRR9008754
## 3 AAACCTGAGCACACAG_SRR9008754
## 4 AAACCTGAGGGCTTGA_SRR9008754
## 5 AAACCTGAGTTGTCGT_SRR9008754
## 6 AAACCTGCAAGTCTGT_SRR9008754
# feature metadata
path <- "Data/Gene_STARsolo/"
file <- "featureData.txt"
df.gene.count.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.gene.count.feature)
##   gene_short_name
## 1     MIR1302-2HG
## 2         FAM138A
## 3           OR4F5
## 4      AL627309.1
## 5      AL627309.3
## 6      AL627309.2
  • For the sparse matrix, rows represent splice junctions, columns represent cells, and the values represent UMI-collapsed, non-normalised, non-log2-transformed gene counts.
  • For the cell metadata, each row represents one cell. Compulsory column is cell.id and this will be used to annotate the columns of the sparse matrix. All other columns are optional.
  • For the feature metadata, each row represents one gene. Compulsory column is gene_short_name and will be used to annotate the rows of the sparse matrix. All other columns are optional.
  • Here, the gene counts were generated using STARsolo (Kaminow et al., 2021).

Splice junction counts

# Matrix
path <- "Data/SJ_STARsolo/"
file <- "matrix_counts.mtx"
df.sj.count <- readMM(paste(path, file, sep=""))

df.sj.count[df.sj.count[,1] !=0,][1:5,1:5]
## 5 x 5 sparse Matrix of class "dgTMatrix"
##                
## [1,] 5 4 7 14 6
## [2,] 1 1 2  1 1
## [3,] 1 . .  . .
## [4,] 1 1 .  . .
## [5,] 1 . .  . .
# cell metadata
path <- "Data/SJ_STARsolo/"
file <- "phenoData.txt"
df.sj.count.pheno <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.sj.count.pheno)
##                       cell.id
## 1 AAACCTGAGAAGATTC_SRR9008754
## 2 AAACCTGAGAAGGGTA_SRR9008754
## 3 AAACCTGAGCACACAG_SRR9008754
## 4 AAACCTGAGGGCTTGA_SRR9008754
## 5 AAACCTGAGTTGTCGT_SRR9008754
## 6 AAACCTGCAAGTCTGT_SRR9008754
# feature metadata
path <- "Data/SJ_STARsolo/"
file <- "featureData.txt"
df.sj.count.feature <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.sj.count.feature)
##        coord.intron
## 1  chr1:14830:14969
## 2 chr1:14830:185490
## 3 chr1:15039:186316
## 4  chr1:16766:16853
## 5  chr1:16766:16857
## 6  chr1:16766:16875
  • For the sparse matrix, rows represent genes, columns represent cells, and the values represent UMI-collapsed, non-normalised, non-log2-transformed splice junction counts.
  • For the cell metadata, each row represents one cell. Compulsory column is cell.id and this will be used to annotate the columns of the sparse matrix. All other columns are optional.
  • For the feature metadata, each row represents one splice junction. Compulsory column is coord.intron and will be used to annotate the rows of the sparse matrix. All other columns are optional.
  • Here, the splice junction counts were generated using STARsolo (Kaminow et al., 2021).

tSNE/UMAP coordinates

path <- "Data/Gene_SingCellaR/"
file <- "dim_red_coordinates_iPSC_CardioDay10.txt"
df.coord <- read.table(paste(path, file, sep=""), sep="\t", header=TRUE, stringsAsFactors=FALSE)

head(df.coord)
##                       cell.id         x          y
## 1 AAACCTGAGAAGATTC_SRR9008754  15.06473 -12.685700
## 2 AAACCTGAGAAGGGTA_SRR9008754  23.00697  10.624021
## 3 AAACCTGAGCACACAG_SRR9008754  12.97456 -18.985449
## 4 AAACCTGAGCTTTGGT_SRR9008753 -20.56841  -1.531026
## 5 AAACCTGAGGGCTTGA_SRR9008754  36.47875  -6.242368
## 6 AAACCTGAGTCGTTTG_SRR9008753 -24.82610 -18.792021
  • This may be coordinates of either UMAP or tSNE. These coordinates will be used to plot the cell’s gene expression or percent spliced-in (PSI) values on the reducted dimension space.
  • Compulsory columns are cell.id, x (x-axis coordinates) and y (y-axis coordinates). All other columns are optional.
  • Here, the tSNE coordinates were generated using SingCellaR (Wang et al., 2022).

Gene transfer file

path <- "Data/GTF/"
file <- "refdata-cellranger-GRCh38-3.0.0.gtf"
gtf <- as.data.frame(fread(paste(path, file, sep=""), sep="\t", header=FALSE, stringsAsFactors=FALSE))
  • This should be the same file previously used for indexing the genome (for STARsolo here), computing gene expression (for Cell Ranger here).

Create MARVEL object

marvel <- CreateMarvelObject.10x(gene.norm.matrix=df.gene.norm,
                                 gene.norm.pheno=df.gene.norm.pheno,
                                 gene.norm.feature=df.gene.norm.feature,
                                 gene.count.matrix=df.gene.count,
                                 gene.count.pheno=df.gene.count.pheno,
                                 gene.count.feature=df.gene.count.feature,
                                 sj.count.matrix=df.sj.count,
                                 sj.count.pheno=df.sj.count.pheno,
                                 sj.count.feature=df.sj.count.feature,
                                 pca=df.coord,
                                 gtf=gtf
                                 )
## [1] "Annotating rows and columns for matrices..."
## [1] "Filling in slots for MARVEL object..."
## [1] "MARVEL object created"

Pre-process data

             This step annotates the gene and splice junction metadata, and subsequently enables us to retain only high-quality splice junctions and genes of particular biotype, e.g., protein-coding genes.

Annotate gene metadata

             This step annotates the genes with their corresponding gene biotype, e.g., protein-coding, lincRNA, pseudogene etc.
marvel <- AnnotateGenes.10x(MarvelObject=marvel)
## [1] "33538 genes found in gene metadata"
## [1] "33514 genes overlapped with GTF and will be subset-ed"
## [1] "Gene type annotated"
## [1] "Gene metadata and normalised gene expression matrix updated"
head(marvel$gene.metadata)
##   gene_short_name      gene_type
## 1     MIR1302-2HG        lincRNA
## 2         FAM138A        lincRNA
## 3           OR4F5 protein_coding
## 4      AL627309.1        lincRNA
## 5      AL627309.3        lincRNA
## 6      AL627309.2      antisense
table(marvel$gene.metadata$gene_type)
## 
##       antisense       IG_C_gene IG_C_pseudogene       IG_D_gene       IG_J_gene 
##            5494              14               9              37              18 
## IG_J_pseudogene       IG_V_gene IG_V_pseudogene         lincRNA  protein_coding 
##               3             144             188            7482           19893 
##       TR_C_gene       TR_D_gene       TR_J_gene TR_J_pseudogene       TR_V_gene 
##               6               4              79               4             106 
## TR_V_pseudogene 
##              33
             This step creates and then annotates the splice junction metadata. For each splice junction, the start and end exons are annotated with their corresponding gene names.

Annotate junction metadata

marvel <- AnnotateSJ.10x(MarvelObject=marvel)
## [1] "Creating splice junction metadata..."
## [1] "Parsing GTF..."
## [1] "Matching gene names with SJ start coordinates in GTF..."
## [1] "Matching gene names with SJ end coordinates in GTF..."
## [1] "Annotating splice junctions..."
## [1] "All SJ successfully annotated"
head(marvel$sj.metadata)
##        coord.intron gene_short_name.start gene_short_name.end
## 1  chr1:14830:14969                  <NA>                <NA>
## 2 chr1:14830:185490                  <NA>                <NA>
## 3 chr1:15039:186316                  <NA>                <NA>
## 4  chr1:16766:16853                  <NA>                <NA>
## 5  chr1:16766:16857                  <NA>                <NA>
## 6  chr1:16766:16875                  <NA>                <NA>
##                               sj.type
## 1 start_unknown.gene|end_unknown.gene
## 2 start_unknown.gene|end_unknown.gene
## 3 start_unknown.gene|end_unknown.gene
## 4 start_unknown.gene|end_unknown.gene
## 5 start_unknown.gene|end_unknown.gene
## 6 start_unknown.gene|end_unknown.gene

Validate junctions

             Only splice junctions whose start and end mapped to same gene are retained.
marvel <- ValidateSJ.10x(MarvelObject=marvel)
## [1] "239458 annotated and uniquely-mapped SJ identified and retained"
## [1] "957704 unannotated SJ identified and removed"
## [1] "11617 multi-mapping SJ identified and removed"
## [1] "5379 unannotated and multi-mapping SJ identified and removed"
## [1] "SJ metadata and SJ count matrix updated"

Subset CDS genes

             Moving forward, we will only retain genes with coding sequence (CDS), i.e., protein-coding genes.
marvel <- FilterGenes.10x(MarvelObject=marvel,
                          gene.type="protein_coding"
                          )
## [1] "19893 of 33514 genes met filtering criteria and retained"
## [1] "227264 of 239458 SJ met filtering criteria and retained"
## [1] "Normalised gene and SJ metadata and matrix updated"

Pre-flight check

             This step ensures that our data is ready for further downstream analysis including differential expression analysis, functional annotation, and candidate gene analysis.
             To this end, we will have to make sure the columns of the matrices align with the cell IDs of the sample metadata and the rows of the matrices align with the feature metadata. Finally, the columns across all matrices should align with one another.
marvel <- CheckAlignment.10x(MarvelObject=marvel)
## [1] "Matching gene list in normalised and raw (count) gene matrices..."
## [1] "19893 genes found in normalised gene matrix"
## [1] "33538 genes found in raw (count) gene matrix"
## [1] "19893 overlapping genes found and retained"
## [1] "Matching cells across normalised gene, raw (count) gene, raw (count) SJ matrices..."
## [1] "32056 cells found in normalised gene matrix"
## [1] "32056 cells found in raw (count) gene matrix"
## [1] "32056 cells found in raw (count) SJ matrix"
## [1] "32056 overlapping cells found and retained"
## [1] "Checking column (sample) alignment..."
## [1] "Sample metadata and normalised gene matrix column names MATCHED"
## [1] "Normalised and raw (count) gene matrix column names MATCHED"
## [1] "Raw (Count) gene and SJ matrix column names MATCHED"
## [1] "Checking row (gene names/SJ coordinates) alignment..."
## [1] "Gene metadata and normalised gene matrix row names MATCHED"
## [1] "Normalised and raw (count) gene matrix row names MATCHED"
## [1] "SJ metadata and raw (count) SJ matrix row names MATCHED"
## [1] "32056 cells and 16071 genes consisting of 227264 splice junctions included for further analysis"
             Our data is ready for downstream analysis when only MATCHED flags are reported. If any NOT MATCHED flags are reported, please double-check the input file requirements.

Save R object

             Given the time taken for data pre-processing, it’ll be a good idea to save the MARVEL object at this step. The object may be loaded whenever the user begins any one of the downstream analysis.
# Save object
path <- "Data/R object/"
file <- "MARVEL.RData"
save(marvel, file=paste(path, file, sep=""))

# Load object
# path <- "Data/R Object/"
# file <- "MARVEL.RData"
# load(paste(path, file, sep=""))

Explore expression

             Due to the huge number of splice junctions detected, it will not be time-effective to include all splice junctions for differential splicing analysis. Therefore, we will only include splice junctions expressed in at least a user-defined proportion of cells for differential splicing analysis. To determine this threshold, we will explore the number of cells expressing a given gene and splice junction.
             We will explore the gene and splice junction expression rate in iPSCs and day-10 cardiomyocytes as we will be performing differential splicing analysis on these two cell populations.

Gene expression rate

# Define cell groups
    # Retrieve sample metadata
    sample.metadata <- marvel$sample.metadata
    
    # Group 1 (reference)
    index <- which(sample.metadata$cell.type=="iPSC")
    cell.ids.1 <- sample.metadata[index, "cell.id"]
    length(cell.ids.1)
## [1] 11244
    # Group 2
    index <- which(sample.metadata$cell.type=="Cardio day 10")
    cell.ids.2 <- sample.metadata[index, "cell.id"]
    length(cell.ids.2)
## [1] 5937
# Explore % of cells expressing genes
marvel <- PlotPctExprCells.Genes.10x(MarvelObject=marvel,
                                     cell.group.g1=cell.ids.1,
                                     cell.group.g2=cell.ids.2,
                                     min.pct.cells=5
                                     )

marvel$pct.cells.expr$Gene$Plot

head(marvel$pct.cells.expr$Gene$Data)
##       cell.group gene_short_name n.cells.total n.cells.expr pct.cells.expr
## 2  cell.group.g1           NOC2L         11244         5590          49.72
## 5  cell.group.g1            HES4         11244          687           6.11
## 6  cell.group.g1           ISG15         11244         7172          63.79
## 7  cell.group.g1            AGRN         11244         2381          21.18
## 12 cell.group.g1            SDF4         11244         4440          39.49
## 13 cell.group.g1        C1QTNF12         11244          565           5.02
  • min.pct.cells option. The minimum percentage of cells expressing a given gene for that gene to be included for expression exploration analysis here.
  • Here, we observe majority of genes to be expressed in ~10% of cells.

SJ expression rate

marvel <- PlotPctExprCells.SJ.10x(MarvelObject=marvel,
                                  cell.group.g1=cell.ids.1,
                                  cell.group.g2=cell.ids.2,
                                  min.pct.cells.genes=5,
                                  min.pct.cells.sj=5,
                                  downsample=TRUE,
                                  downsample.pct.sj=10
                                  )

marvel$pct.cells.expr$SJ$Plot

head(marvel$pct.cells.expr$SJ$Data)
##       cell.group         coord.intron n.cells.total n.cells.expr pct.cells.expr
## 25 cell.group.g1 chr1:1629705:1630291         11244         1227          10.91
## 26 cell.group.g1 chr1:1634329:1634450         11244          832           7.40
## 32 cell.group.g1 chr1:1721712:1722707         11244          867           7.71
## 42 cell.group.g1 chr1:2400936:2402206         11244         1066           9.48
## 47 cell.group.g1 chr1:3775484:3775794         11244         1467          13.05
## 79 cell.group.g1 chr1:8868058:8870451         11244          674           5.99
  • min.pct.cells.genes option. The minimum percentage of cells expressing a given gene for that gene to be included for expression exploration analysis here. This threshold may be inferred from after running the PlotPctExprCells.Genes.10x function earlier.
  • min.pct.cells.sj option. The minimum percentage of cells expressing a given splice junction for that splice junction to be included for expression exploration analysis.
  • downsample option. Down-sample the number of splice junctions prior to expression exploration analysis.
  • downsample.pct.sj option. When downsample set to TRUE, the percentage of splice junctions to keep for expression exploration analysis.
  • Here, we observe majority of splice junctions to be expressed in ~10% of cells.

Differential analysis

             Differential analysis is the cornerstone of RNA-sequencing analysis. This is the first step to identify candidate genes and isoforms for downstream experimental validation.
             For differential splicing analysis, MARVEL utilises the permutation-based approach for comparing the PSI values between two pseudo-bulk samples (Efremova et al., 2020).

             For differential gene expression analysis, MARVEL utilises the Wilcoxon rank-sum test to compare the normalised, log2-transformed gene expression values between the single-cells that constitute the two pseudo-bulk samples (Satija et al., 2015)

Differetial splicing analysis

marvel <- CompareValues.SJ.10x(MarvelObject=marvel,
                               cell.group.g1=cell.ids.1,
                               cell.group.g2=cell.ids.2,
                               min.pct.cells.genes=10,
                               min.pct.cells.sj=10,
                               seed=1,
                               n.iterations=100,
                               downsample=TRUE,
                               show.progress=FALSE
                               )
## [1] "5937 cells from Group 1 and 5937 cells from Group 2 included"
## [1] "7322 genes expressed in cell group 1"
## [1] "9751 genes expressed in cell group 2"
## [1] "6928 genes expressed in BOTH cell group and retained"
## [1] "3784 SJ expressed in cell group 1"
## [1] "6866 SJ expressed in cell group 2"
## [1] "7087 SJ expressed in EITHER cell groups and retained"
## [1] "Total of 7087 SJ from 2765 genes included for DE analysis"
## [1] "Computing PSI for cell group 1..."
## [1] "Computing PSI for cell group 2..."
## [1] "Creating null distributions..."
## [1] "Computing P values..."
  • cell.group.g1 option. Cell IDs corresponding to the reference cell group.
  • cell.group.g2 option. Cell IDs corresponding to the non-reference cell group.
  • min.pct.cells.genes option. The minimum number of cells expressing a given gene for the gene to be included for DE analysis. This threshold may be determined from running the PlotPctExprCells.Genes.10x function earlier.
  • min.pct.cells.sj option. The minimum number of cells expressing a given splice junction for the splice junction to be included for DE analysis. This threshold may be determined from running the PlotPctExprCells.SJ.10x function earlier.
  • seed option. To ensure the null distributions and dowm-sampling process are always reproducible.
  • n.iterations option. The number of permutations to perform when building the null distribution.
  • downsample option. If set to TRUE, the number of cells of each cell group will be down-sampled. The number of cells to down-sample to will be based on the size of smaller cell group.
  • show.progress option. For brevity of this tutorial, we did not track the DE progress here. But users are advised to set this option to TRUE on their own devices.

Differetial gene analysis

marvel <- CompareValues.Genes.10x(MarvelObject=marvel,
                                  show.progress=FALSE
                                  )
## [1] "Performing Wilcox rank sum test..."
head(marvel$DE$SJ$Table)
##           coord.intron gene_short_name n.cells.total.g1 n.cells.expr.sj.g1
## 1 chr1:1311925:1312017          INTS11             5937                346
## 2 chr1:1373903:1373999        AURKAIP1             5937               5858
## 3 chr1:1390564:1390765           CCNL2             5937                 60
## 4 chr1:1402257:1405808          MRPL20             5937               2955
## 5 chr1:1405887:1406908          MRPL20             5937               1069
## 6 chr1:1629705:1630291            MIB2             5937                642
##   pct.cells.expr.sj.g1 n.cells.expr.gene.g1 pct.cells.expr.gene.g1
## 1                 5.83                 2848                  47.97
## 2                98.67                 5897                  99.33
## 3                 1.01                 1278                  21.53
## 4                49.77                 5845                  98.45
## 5                18.01                 5845                  98.45
## 6                10.81                 3045                  51.29
##   sj.count.total.g1 gene.count.total.g1 psi.g1 n.cells.total.g2
## 1               362                4078   8.88             5937
## 2             40782               49806  81.88             5937
## 3                60                1500   4.00             5937
## 4              4425               36002  12.29             5937
## 5              1199               36002   3.33             5937
## 6               708                4716  15.01             5937
##   n.cells.expr.sj.g2 pct.cells.expr.sj.g2 n.cells.expr.gene.g2
## 1                972                16.37                 3423
## 2               5798                97.66                 5858
## 3                724                12.19                 4240
## 4               3516                59.22                 5859
## 5               1249                21.04                 5859
## 6                526                 8.86                 2252
##   pct.cells.expr.gene.g2 sj.count.total.g2 gene.count.total.g2 psi.g2
## 1                  57.66              1067                5478  19.48
## 2                  98.67             32201               38876  82.83
## 3                  71.42               789                8516   9.26
## 4                  98.69              5754               35035  16.42
## 5                  98.69              1445               35035   4.12
## 6                  37.93               547                2891  18.92
##       log2fc delta pval mean.expr.gene.norm.g1.g2 n.cells.total.norm.g1
## 1 1.05163277 10.60    0                 0.4152131                  5937
## 2 0.01644263  0.95    0                 2.1389883                  5937
## 3 1.03703073  5.26    0                 0.3803775                  5937
## 4 0.39040352  4.13    0                 1.9015034                  5937
## 5 0.24177679  0.79    0                 1.9015034                  5937
## 6 0.31524434  3.91    0                 0.3492695                  5937
##   n.cells.expr.gene.norm.g1 pct.cells.expr.gene.norm.g1 mean.expr.gene.norm.g1
## 1                      2848                       47.97              0.4209907
## 2                      5897                       99.33              2.5013445
## 3                      1278                       21.53              0.1613387
## 4                      5845                       98.45              2.1086064
## 5                      5845                       98.45              2.1086064
## 6                      3045                       51.29              0.4691008
##   n.cells.total.norm.g2 n.cells.expr.gene.norm.g2 pct.cells.expr.gene.norm.g2
## 1                  5937                      3423                       57.66
## 2                  5937                      5858                       98.67
## 3                  5937                      4240                       71.42
## 4                  5937                      5859                       98.69
## 5                  5937                      5859                       98.69
## 6                  5937                      2252                       37.93
##   mean.expr.gene.norm.g2 log2fc.gene.norm pval.gene.norm pval.adj.gene.norm
## 1              0.4094355      -0.01155529   4.132942e-01       4.209056e-01
## 2              1.7766321      -0.72471240   0.000000e+00       0.000000e+00
## 3              0.5994163       0.43807758   0.000000e+00       0.000000e+00
## 4              1.6944003      -0.41420606   0.000000e+00       0.000000e+00
## 5              1.6944003      -0.41420606   0.000000e+00       0.000000e+00
## 6              0.2294382      -0.23966252  4.513622e-144      9.004448e-144
  • show.progress option. For brevity of this tutorial, we did not track the DE progress here. But users are advised to set this option to TRUE on their own devices.
  • Please note that MARVEL will only perform differential gene expression analysis on genes included for differential splicing analysis. These genes constitute the splice junctions included for analysis using the CompareValues.SJ.10x function earlier.

Volcano plot: Splicing

marvel <- PlotDEValues.SJ.10x(MarvelObject=marvel,
                              pval=0.05,
                              delta=5,
                              min.gene.norm=1.0,
                              anno=FALSE
                              )

marvel$DE$SJ$VolcanoPlot$SJ$Plot

head(marvel$DE$SJ$VolcanoPlot$SJ$Data[,c("coord.intron", "gene_short_name", "sig")])
##            coord.intron gene_short_name  sig
## 2  chr1:1373903:1373999        AURKAIP1 n.s.
## 4  chr1:1402257:1405808          MRPL20 n.s.
## 5  chr1:1405887:1406908          MRPL20 n.s.
## 15 chr1:2189782:2193638          FAAP20   up
## 18 chr1:6197757:6199561           RPL22 n.s.
## 19 chr1:6820251:6825091          CAMTA1 n.s.
  • pval option. The p-value, below which, the splice junction is considered to be differentially spliced.
  • delta option. The absolute difference in mean PSI values between the two cell populations, above which, the splice junction is considered to be differentially spliced.
  • min.gene.norm option. The mean normalised gene expression value across all cells from both populations, above which, the corresponding splice junction is considered to be expressed.

Gene-splicing dynamics

             MARVEL’s integrated differential gene and splicing analysis enables us to investigate how gene expression changes relative to splice junction changes when iPSCs differentiate into day-10 cardiomyocytes. The gene-splicing dynamics may be classified into four categories, namely coordinated, opposing, isoform-switching, and complex.
  • Coordinated gene-splicing relationship refers to the change in mean gene expression is in the same direction with the corresponding splicing junction(s).
  • Opposing gene-splicing relationship refers to the change in mean gene expression is in the opposite direction to the corresponding splicing junction(s).
  • Isoform-switching refers to genes that are differentially spliced without being differentially expressed.
  • Complex gene-splicing relationship refers to genes with both coordinated and opposing relationships with the corresponding splicing junctions.
              Here, we will explore the gene-splicing dynamics of genes that are differentially spliced between iPSCs and day-10 cardiomyocytes. Representative examples of each dynamic will also be shown. This section also utilises the ad hoc plotting functions PlotValues.PCA.CellGroup.10x, PlotValues.PCA.Gene.10x, PlotValues.PCA.PSI.10x for plotting cell groups, PSI, and gene expression, respectively, on the reduced dimension space.
              Please not users are required to run the CompareValues.SJ.10x and CompareValues.Genes.10x functions (refer to “Differential splicing analysis” section) before proceeding with this section.

Assign dynamics

marvel <- IsoSwitch.10x(MarvelObject=marvel,
                        pval.sj=0.05,
                        delta.sj=5,
                        min.gene.norm=1.0,
                        pval.adj.gene=0.05,
                        log2fc.gene=0.5
                        )

marvel$SJ.Gene.Cor$Proportion$Plot

marvel$SJ.Gene.Cor$Proportion$Table
##   sj.gene.cor freq       pct
## 2 Coordinated  101 18.738404
## 4    Opposing   83 15.398887
## 3  Iso-Switch  317 58.812616
## 1     Complex   38  7.050093
head(marvel$SJ.Gene.Cor$Data[,c("coord.intron", "gene_short_name", "cor.complete")])
##             coord.intron gene_short_name cor.complete
## 1   chr1:2189782:2193638          FAAP20   Iso-Switch
## 2 chr1:11054886:11054961             SRM     Opposing
## 3 chr1:11674817:11675081          MAD2L2     Opposing
## 4 chr1:20652529:20652620           DDOST   Iso-Switch
## 5 chr1:22086867:22091427           CDC42  Coordinated
## 6 chr1:23313703:23318482          HNRNPR  Coordinated
  • pval.sj option. The p-value, below which, the splice junction is considered to be differentially spliced.
  • delta.sj option. The absolute difference in mean PSI values between the two cell populations, above which, the splice junction is considered to be differentially spliced.
  • min.gene.norm option. The mean normalised gene expression value across all cells from both populations, above which, the corresponding splice junction is considered to be expressed.
  • pval.adj.gene option. The p-value, below which, the gene is considered to be differentially expressed.
  • log2fc.gene option. The absolute log2 fold change in gene expression, above which, the gene is considered to be differentially expressed.
  • Here, we observe majority of differentially spliced genes underwent isoform-switching followed by coordinated, opposing, and then complex gene expression changes relative to splicing changes.
  • Note that majority of differentially spliced genes may not be inferred directly from differential gene expression analysis alone. For example, only in coordinated gene-splicing relationship that the gene and splicing changes between iPSCs and day-10 cardiomyocytes are in the same direction. But this category only constitute around one-fifth of gene-splicing relationships.

Coordinated

# Define cell groups
    # Retrieve sample metadata
    sample.metadata <- marvel$sample.metadata
    
    # iPSC
    index <- which(sample.metadata$cell.type=="iPSC")
    cell.ids.1 <- sample.metadata[index, "cell.id"]
    length(cell.ids.1)
## [1] 11244
    # Cardio day 10
    index <- which(sample.metadata$cell.type=="Cardio day 10")
    cell.ids.2 <- sample.metadata[index, "cell.id"]
    length(cell.ids.2)
## [1] 5937
    # Save into list
    cell.group.list <- list("iPSC"=cell.ids.1,
                            "Cardio d10"=cell.ids.2
                            )
    
# Plot cell groups
marvel <- PlotValues.PCA.CellGroup.10x(MarvelObject=marvel,
                                       cell.group.list=cell.group.list,
                                       legendtitle="Cell group",
                                       type="tsne"
                                       )
## [1] "All cells defined with coordinates found"
plot_group <- marvel$adhocPlot$PCA$CellGroup

# Plot gene expression
marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                  gene_short_name="VIM",
                                  color.gradient=c("grey","cyan","green","yellow","red"),
                                  type="tsne"
                                  )


plot_gene <- marvel$adhocPlot$PCA$Gene

# Plot PSI
marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                 coord.intron="chr10:17235390:17235845",
                                 min.gene.count=3,
                                 log2.transform=FALSE,
                                 color.gradient=c("grey","cyan","green","yellow","red"),
                                 type="tsne"
                                 )
 
plot_sj <- marvel$adhocPlot$PCA$PSI

# Arrange and view plots
grid.arrange(plot_group, plot_gene, plot_sj, nrow=1)

  • For VIM, both gene expression (middle) and the corresponding splice junction (right) shown here are up-regulated in day-10 cardiomyocytes relative to iPSCs.
  • Note gene expression represented by log2(expression) scale whereas splicing rate represented by PSI scale.

Opposing

# Plot gene expression
marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                  gene_short_name="UQCRH",
                                  color.gradient=c("grey","cyan","green","yellow","red"),
                                  type="tsne"
                                  )


plot_gene <- marvel$adhocPlot$PCA$Gene

# Plot PSI
marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                 coord.intron="chr1:46310317:46316551",
                                 min.gene.count=3,
                                 log2.transform=FALSE,
                                 color.gradient=c("grey","cyan","green","yellow","red"),
                                 type="tsne"
                                 )
 
plot_sj <- marvel$adhocPlot$PCA$PSI

# Arrange and view plots
grid.arrange(plot_group, plot_gene, plot_sj, nrow=1)

  • For UQRCH, the gene expression (middle) is up-regulated in iPSCs relative to day-10 cardiomyocytes.
  • On the other hand, the corresponding splice junction (right) shown here is up-regulated in day-10 cardiomyocytes relative to iPSCs.
  • Note gene expression represented by log2(expression) scale whereas splicing rate represented by PSI scale.

Iso-switch

# Plot gene expression
marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                  gene_short_name="RBM39",
                                  color.gradient=c("grey","cyan","green","yellow","red"),
                                  type="tsne"
                                  )


plot_gene <- marvel$adhocPlot$PCA$Gene

# Plot PSI: SJ-1
marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                 coord.intron="chr20:35740888:35741940",
                                 min.gene.count=3,
                                 log2.transform=FALSE,
                                 color.gradient=c("grey","cyan","green","yellow","red"),
                                 type="tsne"
                                 )
 
plot_sj_1 <- marvel$adhocPlot$PCA$PSI

# Plot PSI: SJ-2
marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                 coord.intron="chr20:35739018:35740823",
                                 min.gene.count=3,
                                 log2.transform=FALSE,
                                 color.gradient=c("grey","cyan","green","yellow","red"),
                                 type="tsne"
                                 )
 
plot_sj_2 <- marvel$adhocPlot$PCA$PSI

# Arrange and view plots
grid.arrange(plot_gene, plot_sj_1, plot_sj_2, nrow=1)

  • For RBM39, the gene (left) is not differentially expressed between iPSCs and day-10 cardiomyocytes.
  • On the other hand, both corresponding splice junctions (middle and right) are up-regulated in iPSCs relative to day-10 cardiomyocytes.
  • Note gene expression represented by log2(expression) scale whereas splicing rate represented by PSI scale.

Complex

# Gene 1
  # Plot gene expression
  marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                    gene_short_name="TPM1",
                                    color.gradient=c("grey","cyan","green","yellow","red"),
                                    type="tsne"
                                    )

  plot.1_gene <- marvel$adhocPlot$PCA$Gene
  
  # Plot PSI: SJ-1
  marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                   coord.intron="chr15:63064143:63065895",
                                   min.gene.count=3,
                                   log2.transform=FALSE,
                                   color.gradient=c("grey","cyan","green","yellow","red"),
                                   type="tsne"
                                   )
   
  plot.1_sj_1 <- marvel$adhocPlot$PCA$PSI
  
  # Plot PSI: SJ-2
  marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                   coord.intron="chr15:63044153:63056984",
                                   min.gene.count=3,
                                   log2.transform=FALSE,
                                   color.gradient=c("grey","cyan","green","yellow","red"),
                                   type="tsne"
                                   )
   
  plot.1_sj_2 <- marvel$adhocPlot$PCA$PSI

# Gene 2
  # Plot gene expression
  marvel <- PlotValues.PCA.Gene.10x(MarvelObject=marvel,
                                    gene_short_name="TPM2",
                                    color.gradient=c("grey","cyan","green","yellow","red"),
                                    type="tsne"
                                    )
  
  
  plot.2_gene <- marvel$adhocPlot$PCA$Gene
  
  # Plot PSI: SJ-1
  marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                   coord.intron="chr9:35684808:35685268",
                                   min.gene.count=3,
                                   log2.transform=FALSE,
                                   color.gradient=c("grey","cyan","green","yellow","red"),
                                   type="tsne"
                                   )
   
  plot.2_sj_1 <- marvel$adhocPlot$PCA$PSI
  
  # Plot PSI: SJ-2
  marvel <- PlotValues.PCA.PSI.10x(MarvelObject=marvel,
                                   coord.intron="chr9:35684551:35685063",
                                   min.gene.count=3,
                                   log2.transform=FALSE,
                                   color.gradient=c("grey","cyan","green","yellow","red"),
                                   type="tsne"
                                   )
   
  plot.2_sj_2 <- marvel$adhocPlot$PCA$PSI
  
# Arrange and view plots
grid.arrange(plot.1_gene, plot.1_sj_1, plot.1_sj_2, 
             plot.2_gene, plot.2_sj_1, plot.2_sj_2,
             nrow=2
             )

  • For both TPM1 and TPM2 genes, their gene expressions (left) are up-regulated in day-10 cardiomyocytes relative to iPSCs.
  • Similar to gene expression profile, one of the splice junctions (middle) for these genes are up-regulated in day-10 cardiomyocytes relative to iPSCs.
  • On the other hand, the other splice junction (right) for these genes are down-regulated in day-10 cardiomyocytes relative to iPSCs.
  • Note gene expression represented by log2(expression) scale whereas splicing rate represented by PSI scale.

Gene ontology analysis

             Gene ontology analysis or pathway enrichment analysis categorises the differentially spliced genes between iPSCs and day-10 cardiomyocytes into biological pathways. This may identify sets of genes with similar function or belong to similar biological pathways that are concurrently spliced.
marvel <- BioPathways.10x(MarvelObject=marvel,
                          pval=0.05,
                          delta=5,
                          min.gene.norm=1.0,
                          method.adjust="fdr",
                          species="human"
                          )
## [1] "539 unique genes identified for GO analysis"
head(marvel$DE$BioPathways$Table)
##                    ID                                   Description GeneRatio
## GO:0006413 GO:0006413                      translational initiation    81/514
## GO:0006402 GO:0006402                        mRNA catabolic process   104/514
## GO:0006613 GO:0006613 cotranslational protein targeting to membrane    64/514
## GO:0019080 GO:0019080                         viral gene expression    74/514
## GO:0006119 GO:0006119                     oxidative phosphorylation    48/514
## GO:0022613 GO:0022613          ribonucleoprotein complex biogenesis    69/514
##              BgRatio enrichment       pvalue     p.adjust       qvalue
## GO:0006413 192/18866  15.484618 7.365368e-76 1.521578e-72 1.308689e-72
## GO:0006402 376/18866  10.152248 7.667309e-76 1.521578e-72 1.308689e-72
## GO:0006613 109/18866  21.551137 4.346461e-72 4.312776e-69 3.709361e-69
## GO:0019080 195/18866  13.928804 4.244284e-65 2.105695e-62 1.811081e-62
## GO:0006119 149/18866  11.824198 2.338011e-38 5.799729e-36 4.988270e-36
## GO:0022613 482/18866   5.254347 2.570165e-30 5.667214e-28 4.874296e-28
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       geneID
## GO:0006413                                                                                                                                   RPL11/YTHDF2/RPS8/RPL5/RPS27/TPR/CDC123/RPS24/RPLP2/EIF3F/RPS13/EIF3M/POLR2G/RPS3/RPS25/EIF4B/RPS26/RPL6/RPLP0/RPS29/EIF5/EIF2AK4/EIF3J/RPL4/RPLP1/RPS2/RPS15A/RPL13/RPL23/RPL19/EIF1/RPL38/RPL36/RPS28/EIF3G/RPL18A/UBA52/EIF3K/RPS16/RPS19/RPL18/RPL13A/RPS11/RPS9/RPL28/RPS5/RPS7/RPS27A/EIF5B/RPL31/RPL37A/EIF2S2/EIF3L/RPL3/RPSA/RPL14/RPL24/RPL35A/RPL9/RPL34/RPS3A/RPL37/RPS23/PAIP2/NPM1/RPS18/RPL10A/RPS12/HSPB1/RPS20/RPL7/RPL30/EIF3E/EIF3H/RPL8/RPS6/RPL12/RPL7A/RPS4X/RPL10/RPS4Y1
## GO:0006402 HNRNPR/RPL11/YTHDF2/PSMB2/THRAP3/YBX1/RPS8/SERBP1/RPL5/PSMA5/CSDE1/RBM8A/PSMD4/PSMB4/RPS27/HNRNPU/VIM/RPS24/RPLP2/RPS13/PSMC3/POLR2G/RPS3/RPS25/YBX3/RPS26/RPL6/RPLP0/HNRNPC/PSME2/RPS29/PSMA3/RPL4/RPLP1/PSMA4/RPS2/RPS15A/FUS/RPL13/PSMB6/PSMB3/RPL23/RPL19/PSMC5/DDX5/RPL38/CIRBP/RPL36/RPS28/HNRNPM/RPL18A/UBA52/PSMD8/RPS16/RPS19/RPL18/RPL13A/RPS11/RPS9/RPL28/RPS5/RPS7/RPS27A/RPL31/SSB/RPL37A/XRN2/YWHAB/RPL3/LSM3/RPSA/RPL14/RPL24/DHX36/FXR1/RPL35A/RPL9/RPL34/LSM6/RPS3A/RPL37/RPS23/NPM1/RPS18/RPL10A/SYNCRIP/RPS12/PSMB1/HSPB1/RPS20/RPL7/RPL30/YWHAZ/EIF3E/RPL8/RPS6/PSMB7/RPL12/SET/RPL7A/NBDY/RPS4X/RPL10/RPS4Y1
## GO:0006613                                                                                                                                                                                                                                          RPL11/RPS8/RPL5/RPS27/RPS24/RPLP2/RPS13/RPS3/RPS25/RPS26/RPL6/RPLP0/RPS29/RPL4/RPLP1/RPS2/RPS15A/RPL13/RPL23/RPL19/RPL38/RPL36/RPS28/RPL18A/UBA52/RPS16/RPS19/RPL18/RPL13A/RPS11/RPS9/RPL28/RPS5/RPS7/RPS27A/RPL31/RPL37A/RPL3/RPSA/RPL14/RPL24/SSR3/SEC62/RPL35A/RPL9/SRP72/RPL34/RPS3A/RPL37/RPS23/RPS18/RPL10A/RPS12/RPS20/RPL7/RPL30/RPL8/RPS6/SEC61B/RPL12/RPL7A/RPS4X/RPL10/RPS4Y1
## GO:0019080                                                                                                                                                                         RPL11/RPS8/RPL5/RPS27/TPR/NUCKS1/RPS24/IFITM3/RPLP2/EIF3F/RPS13/PSMC3/POLR2G/RPS3/RPS25/PCBP2/RPS26/RPL6/RPLP0/RPS29/EIF2AK4/RPL4/RPLP1/RPS2/RPS15A/RPL13/RPL23/RPL19/RPL38/RPL36/RPS28/EIF3G/SMARCA4/RPL18A/UBA52/RPS16/RPS19/RPL18/RPL13A/RPS11/RPS9/RPL28/RPS5/RPS7/RPS27A/RPL31/SSB/RPL37A/EIF3L/POLR2F/RPL3/RPSA/RPL14/RPL24/RPL35A/RPL9/RPL34/RPS3A/RPL37/RPS23/RPS18/RPL10A/RPS12/POLR2J/RPS20/RPL7/RPL30/RPL8/RPS6/RPL12/RPL7A/RPS4X/RPL10/RPS4Y1
## GO:0006119                                                                                                                                                                                                                                                                                UQCRH/NDUFS2/ATP5F1C/NDUFS3/NDUFS8/ATP5MC2/ATP5F1B/NDUFA12/DNAJC15/NDUFB1/COX5A/NDUFB10/UQCRC2/NDUFAB1/COX4I1/ATP5PD/NDUFV2/ATP5F1A/ATP5F1D/NDUFS7/NDUFA3/ATP5MC3/NDUFB3/ATP5F1E/ATP5PF/NDUFA6/UQCRC1/NDUFB4/ATP5ME/NDUFC1/NDUFS6/NDUFS4/COX7C/UQCRQ/COX7A2/NDUFA4/CYCS/ATP5MF/NDUFA5/NDUFB2/COX6C/NDUFB9/CYC1/NDUFB6/STOML2/NDUFA8/NDUFB11/NDUFA1
## GO:0022613                                                                                                                                                                                                           RPL11/RPS8/RPL5/RPS27/SNRPE/RPS24/NPM3/EIF3F/EIF3M/TRMT112/MRPL11/EIF4B/PA2G4/SNRPF/RPL6/RPLP0/GTF3A/SRSF5/EIF5/EIF3J/RPS2/RSL1D1/RPL38/SNRPD1/RPS28/EIF3G/EIF3K/RPS16/FBL/RPS19/SNRPD2/NOP53/PIH1D1/RPL13A/RPS9/RPS5/RPS7/CEBPZ/SNRPG/DDX18/SF3B1/SNRPB/NOP56/XRN2/EIF2S2/EIF3L/DDX17/RPL3/RPSA/RPL14/RPL24/RPL35A/LSM6/NSA2/MRPL22/NPM1/RPL26L1/SNRPC/RPL10A/PRKDC/RPL7/EIF3E/EIF3H/PSIP1/RPS6/RPL12/RPL7A/PIN4/RPL10
##            Count
## GO:0006413    81
## GO:0006402   104
## GO:0006613    64
## GO:0019080    74
## GO:0006119    48
## GO:0022613    69
  • pval option. The p-value, below which, the splice junction is considered to be differentially spliced.
  • delta option. The absolute difference in mean PSI values between the two cell populations, above which, the splice junction is considered to be differentially spliced.
  • min.gene.norm option. The mean normalised gene expression value across all cells from both populations, above which, the corresponding splice junction is considered to be expressed.
  • species option. MARVEL also supports GO analysis of "mouse".
  • Alternative to using the pval, delta, and min.gene.norm options for specifying differentially spliced genes for GO analysis, users may input their own custom genes for GO analysis using the custom.genes option.
# Plot pathways related to cardiomyocyte development
go.terms <- c("ATP metabolic process",
              "regulation of stem cell differentiation",
              "Wnt signaling pathway, planar cell polarity pathway",
              "non-canonical Wnt signaling pathway",
              "muscle filament sliding",
              "actin-myosin filament sliding",
              "regulation of ATPase activity",
              "regulation of fibroblast proliferation",
              "actomyosin structure organization",
              "myofibril assembly",
              "negative regulation of calcium-mediated signaling"
              )

marvel <- BioPathways.Plot.10x(MarvelObject=marvel,
                               go.terms,
                               y.label.size=8,
                               offset=0.5
                               )

marvel$DE$BioPathways$Plot

Candidate gene analysis

             Users may have specific candidate genes of interest to investigate in the single-cell RNA-sequencing dataset. These candidate genes may have been identified from initial differential gene expression analysis of the same dataset or from an earlier discovery dataset, e.g. bulk RNA-sequencing dataset, or that reported from the literature.
             Here, we will illustrate several related functions for investigating the gene-splicing relationship between a selected gene and its corresponding splice junctions. This may reveal candidate isoforms for downstream functional validation.
             We have chosen TPM2 for this illustration as we have seen earlier that this gene has a complex relationship with its splice junctions (see “Gene-splicing dynamics” section of this tutorial). Therefore, this gene will be a good example for demonstrating the intricate relationship betweeh splice junction usage and gene expression profile.

Define cell groups

             Our earlier differential analysis included only iPSCs and day-10 cardiomyocytes. Here, we will include all cell types in this dataset, namely iPSCs, and day-2, -4, and -10 cardiomyocytes.
# Retrieve sample metadata
sample.metadata <- marvel$sample.metadata

# iPSC
index <- which(sample.metadata$cell.type=="iPSC")
cell.ids.1 <- sample.metadata[index, "cell.id"]
length(cell.ids.1)
## [1] 11244
# Cardio day 2
index <- which(sample.metadata$cell.type=="Cardio day 2")
cell.ids.2 <- sample.metadata[index, "cell.id"]
length(cell.ids.2)
## [1] 6240
# Cardio day 4
index <- which(sample.metadata$cell.type=="Cardio day 4")
cell.ids.3 <- sample.metadata[index, "cell.id"]
length(cell.ids.3)
## [1] 8635
# Cardio day 10
index <- which(sample.metadata$cell.type=="Cardio day 10")
cell.ids.4 <- sample.metadata[index, "cell.id"]
length(cell.ids.4)
## [1] 5937
# Save into list
cell.group.list <- list("iPSC"=cell.ids.1,
                        "day2"=cell.ids.2,
                        "day4"=cell.ids.3,
                        "day10"=cell.ids.4
                        )

Gene expression profiling

marvel <- adhocGene.TabulateExpression.Gene.10x(MarvelObject=marvel,
                                                cell.group.list=cell.group.list,
                                                gene_short_name="TPM2",
                                                min.pct.cells=10,
                                                downsample=TRUE
                                                )
## [1] "Downsampling to 5937 cells per group"
marvel$adhocGene$Expression$Gene$Plot

marvel$adhocGene$Expression$Gene$Table
##   group mean.expr pct.cells.expr
## 1  iPSC  1.533693       93.39734
## 2  day2  2.064926       99.52838
## 3  day4  2.889787       99.89894
## 4 day10  2.919375       99.89894
  • min.pct.cells option. The minimum percentage of cells the gene is expressed in for the gene to be considered expressed and included for analysis here.
  • downsample option. If set to TRUE, the number of cells of each cell group will be down-sampled. The number of cells to down-sample to will be based on the size of smallest cell group.
  • We observe progressive increased in TPM2 gene expression from iPSCs to early cardiomyocytes and then late cardiomyocytes.

SJ usage profiling

marvel <- adhocGene.TabulateExpression.PSI.10x(MarvelObject=marvel,
                                               min.pct.cells=10
                                               )

marvel$adhocGene$Expression$PSI$Plot

head(marvel$adhocGene$Expression$PSI$Table)
##   group figure.column           coord.intron n.cells.total n.cells.expr.sj
## 1  iPSC          SJ-1 chr9:35682164:35684245          5937            5209
## 2  iPSC          SJ-2 chr9:35684316:35684487          5937            4860
## 3  iPSC          SJ-3 chr9:35684551:35684731          5937            2754
## 4  iPSC          SJ-4 chr9:35684808:35685268          5937            1013
## 5  iPSC          SJ-5 chr9:35685340:35685433          5937             919
## 6  iPSC          SJ-9 chr9:35684551:35685063          5937            1073
##   pct.cells.expr.sj sj.count.total gene.count.total   psi
## 1             87.74          15904            21778 73.03
## 2             81.86          12730            21778 58.45
## 3             46.39           4083            21778 18.75
## 4             17.06           1152            21778  5.29
## 5             15.48           1039            21778  4.77
## 6             18.07           1229            21778  5.64
  • min.pct.cells option. The minimum percentage of cells the splice junction is expressed in for the splice junction to be considered expressed and included for analysis here.
  • We observe 10 splice junctions to be expressed in at least one cell population.
  • SJ-1 is more highly expressed in iPSCs and early cardiomyocytes relative to late cardiomyocytes.
  • However, there is virtually no difference in SJ-2 expression across all cell populations.
  • On the other hand, SJ-3 is more highly expressed in cardiomyocytes relative to iPSCs.

Gene-splicing relationship

             Let’s investigate how the top 3 most highly-expressed splice junctions’ usage (SJ-1, -2, and -3) changes relative to TPM2 gene expression changes for all possible pair-wise comparison between iPSCs, and day-2, -4, and -10 cardiomyocytes.
# Perform differential gene expression analysis for all pair-wise group comparison
marvel <- adhocGene.DE.Gene.10x(MarvelObject=marvel)

# Perform differential splicing expression analysis for all pair-wise group comparison
marvel <- adhocGene.DE.PSI.10x(MarvelObject=marvel)

# Retrieve SJ DE results table
results <- marvel$adhocGene$DE$PSI$Data

# SJ-1
  # Define SJ
  coord.intron <- results[which(results$figure.column=="SJ-1"), "coord.intron"]
  coord.intron <- unique(coord.intron)
  coord.intron
## [1] "chr9:35682164:35684245"
  # Plot
  marvel <- adhocGene.PlotDEValues.10x(MarvelObject=marvel,
                                       coord.intron=coord.intron,
                                       log2fc.gene=0.5,
                                       delta.sj=5,
                                       label.size=2,
                                       point.size=2,
                                       xmin=-2.0,
                                       xmax=2.0,
                                       ymin=-25,
                                       ymax=25
                                       )

  plot.1 <- marvel$adhocGene$DE$VolcanoPlot$Plot

# SJ-2
  # Define SJ
  coord.intron <- results[which(results$figure.column=="SJ-2"), "coord.intron"]
  coord.intron <- unique(coord.intron)
  coord.intron
## [1] "chr9:35684316:35684487"
  # Plot
  marvel <- adhocGene.PlotDEValues.10x(MarvelObject=marvel,
                                       coord.intron=coord.intron,
                                       log2fc.gene=0.5,
                                       delta.sj=5,
                                       label.size=2,
                                       point.size=2,
                                       xmin=-2.0,
                                       xmax=2.0,
                                       ymin=-25,
                                       ymax=25
                                       )
                                         
  plot.2 <- marvel$adhocGene$DE$VolcanoPlot$Plot    

# SJ-3
    # Define SJ
    coord.intron <- results[which(results$figure.column=="SJ-3"), "coord.intron"]
    coord.intron <- unique(coord.intron)
    coord.intron
## [1] "chr9:35684551:35684731"
    # Plot
    marvel <- adhocGene.PlotDEValues.10x(MarvelObject=marvel,
                                         coord.intron=coord.intron,
                                         log2fc.gene=0.5,
                                         delta.sj=5,
                                         label.size=2,
                                         point.size=2,
                                         xmin=-2.0,
                                         xmax=2.0,
                                         ymin=-25,
                                         ymax=25
                                         )
                                         
    plot.3 <- marvel$adhocGene$DE$VolcanoPlot$Plot    

# Arrange and view plots
grid.arrange(plot.1, plot.2, plot.3,
             nrow=3
             )

  • log2fc.gene option. The absolute log2 fold change, above which, the gene is considered differentially expressed.
  • delta.sj option. The absolute difference in mean PSI values between two cell populations, above which, the splice junction is considered differentially spliced.
  • SJ-1 is more highly expressed in iPSCs and early cardiomyocytes relative to late cardiomyocytes.
  • However, there is virtually no difference in SJ-2 expression across all cell populations.
  • On the other hand, SJ-3 is more highly expressed in cardiomyocytes relative to iPSCs.

Locate SJ position

             Locating the position of the splice junctions on the isoforms corresponding to TPM2 will reveal the specific isoforms expressed in our dataset for this.
# SJ-1
    # Define SJs to plot
    coord.intron <- results[which(results$figure.column=="SJ-1"), "coord.intron"]
    coord.intron <- unique(coord.intron)
    coord.intron
## [1] "chr9:35682164:35684245"
    # Plot
    marvel <- adhocGene.PlotSJPosition.10x(MarvelObject=marvel,
                                          coord.intron=coord.intron,
                                          rescale_introns=FALSE,
                                          show.protein.coding.only=TRUE,
                                          anno.label.size=1.5
                                          )
## [1] "Retrieving transcripts from GTF file..."
## [1] "12 transcripts identified"
    marvel$adhocGene$SJPosition$Plot

# SJ-2
    # Define SJs to plot
    coord.intron <- results[which(results$figure.column=="SJ-2"), "coord.intron"]
    coord.intron <- unique(coord.intron)
    coord.intron
## [1] "chr9:35684316:35684487"
    # Plot
    marvel <- adhocGene.PlotSJPosition.10x(MarvelObject=marvel,
                                          coord.intron=coord.intron,
                                          rescale_introns=FALSE,
                                          show.protein.coding.only=TRUE,
                                          anno.label.size=1.5
                                          )
## [1] "Retrieving transcripts from GTF file..."
## [1] "12 transcripts identified"
    marvel$adhocGene$SJPosition$Plot

# SJ-3
    # Define SJs to plot
    coord.intron <- results[which(results$figure.column=="SJ-3"), "coord.intron"]
    coord.intron <- unique(coord.intron)
    coord.intron
## [1] "chr9:35684551:35684731"
    # Plot
    marvel <- adhocGene.PlotSJPosition.10x(MarvelObject=marvel,
                                          coord.intron=coord.intron,
                                          rescale_introns=FALSE,
                                          show.protein.coding.only=TRUE,
                                          anno.label.size=1.5
                                          )
## [1] "Retrieving transcripts from GTF file..."
## [1] "12 transcripts identified"
    marvel$adhocGene$SJPosition$Plot

  • rescale_introns option. If set to TRUE, the introns will be minimise in order to emphasise the exons and splice junctions.
  • show.protein.coding.only option. If set to TRUE, only protein-coding isoforms will be displayed.
  • We may infer that ENST00000329305 is the dominant isoform expressed in this dataset because it contains the 3 top highly expressed splice junctions.
  • Due to differential splice junction usage across the different cell population, it is conceivable that isoforms other than ENST00000329305 are also expressed in this dataset, and are differentially expressed across the different cell population.

Aggregate figures

             Finally, we may tabulate the earlier figures above into a publication ready format.